20180608 一個更人性化的方式推薦你專輯
之前的代碼我們學過如何從評價和類別篩選出專輯,
現在我有一個更人性化的方案來做這件事,
輸入一個喜歡的藝人,隨機推薦你三張專輯。
從代碼上來看,沒有用到更新的技巧,
流程大概如下:
1.爬取該artist的網址
2.爬取該artist的音樂分類
3.建立音樂分類辭典
4.將先前取得的音樂分類,轉換並建立之後POST爬蟲以所需要的headers和data
5.以建立的data (分類與評價) 開始爬取所有專輯 ,加入List
6.隨機在list內選擇3張專輯
這將會使的整個程式執行起來相當慢,
也許之後可以加入一些import file或異步(asynico)的方式來做更高級的爬蟲。
1.爬取該artist的網址
#導入模組
from seleniumrequests import Chrome
import requests
import re
from bs4 import BeautifulSoup
import random
#輸入想要查詢的專輯,並將關鍵字直接給到amg_artist的search
artist_input = input('type artist you like: ')
search_url = "https://www.allmusic.com/search/artists/" + artist_input
#接著用爬蟲的方式將search_url,將該關鍵字的artist網址抓取出來,用到seleniumrequestsd
chrome_path = r"C:\Users\Ramone\seleniumdriver\chrome\chromedriver.exe" # 給定一個瀏覽器的local位置
webdriver = Chrome(chrome_path) # 導入Chorme當作webdriver
search_res = webdriver.request('GET',search_url)
search_soup = BeautifulSoup(search_res.text,'lxml')
search_source = search_soup.find('div',{'class':'name'})
artist_url = re.search(re.compile(r'https://www.allmusic.com/artist/.*(?<=\d)'),str(search_source)).group()
print (artist_url)
###output###
#https://www.allmusic.com/artist/the-beach-boys-mn0000041874
2.用剛剛得到的artist網址(artist_url),爬取其在AMG的音樂分類
#用開發人員工具找到style的規則丟給bs
webdriver = Chrome(chrome_path) #可省
artist_res = webdriver.request('GET',artist_url)
artist_soup = BeautifulSoup(artist_res.text,'lxml')
artist_source = artist_soup.find_all('a',{'href':re.compile(r'https://www.allmusic.com/style/.*')})
#將爬取結果加到list裏頭待之後取用
style_list=[]
for s in artist_source:
style_list.append(s.text)
print (s.text)
3.接著爬取音樂分類在AMG的代碼並建立字典 (與前篇的代碼相同)
webdriver = Chrome(chrome_path) # 導入Chorme當作webdriver
#建立Label字典
label_res= webdriver.request('GET','https://www.allmusic.com/advanced-search/')
label_soup = BeautifulSoup(label_res.text,"lxml")
label = label_soup.find_all('input',{'id':re.compile('genreid.*?')})
label_dict={}
for l in label:
label_dict[l['value']]=l['id']
#建立評價字典
rating_dict={}
star=1.0
for i in range(1,10):
rating_dict[str(star)]='editorialrating:'+str(i)
star+=0.5
#print
for i in style_list:
print (i,' : ',label_dict[i])
###output###
#AM Pop : subgenreid:MA0000012000
#Early Pop/Rock : subgenreid:MA0000002763
#Surf : subgenreid:MA0000002883
#Contemporary Pop/Rock : subgenreid:MA0000004443
#Sunshine Pop : subgenreid:MA0000012028
#Psychedelic Pop : subgenreid:MA0000011915
#Rock & Roll : subgenreid:MA0000002829
#Psychedelic/Garage : subgenreid:MA0000002800
4.建立post所需要用的headers和data,
headers直接給值,data則從剛剛style_list來建立。
#AMG固定header
amg_header={
'authority': r'www.allmusic.com',
'method': r'POST',
'path': r'/advanced-search/results/',
'scheme': r'https',
'accept': r'text/html, */*; q=0.01',
'accept-encoding': r'gzip, deflate, br',
'accept-language': r'zh-TW,zh;q=0.9,en-US;q=0.8,en;q=0.7,zh-CN;q=0.6',
'content-length': '185',
'content-type': r'application/x-www-form-urlencoded; charset=UTF-8',
'cookie': r'_ga=GA1.2.85029673.1513518205; __gads=ID=3275e8321c618a22:T=1513518176:S=ALNI_MbT7eOHrtfYxgOBlXi-4NZzwkA01Q; __qca=P0-704611526-1513518207321; policy=notified; _gid=GA1.2.15574113.1527937083; bm_monthly_unique=true; registration_prompt=true; bm_last_load_status=BLOCKING; advancedSearchLogic=and; allmusic_session=a%3A6%3A%7Bs%3A10%3A%22session_id%22%3Bs%3A32%3A%22f55df85c5fd33a3550642ff7f525e829%22%3Bs%3A10%3A%22ip_address%22%3Bs%3A11%3A%2210.128.8.31%22%3Bs%3A10%3A%22user_agent%22%3Bs%3A114%3A%22Mozilla%2F5.0+%28Windows+NT+10.0%3B+Win64%3B+x64%29+AppleWebKit%2F537.36+%28KHTML%2C+like+Gecko%29+Chrome%2F67.0.3396.62+Safari%2F537.36%22%3Bs%3A13%3A%22last_activity%22%3Bi%3A1528081392%3Bs%3A9%3A%22user_data%22%3Bs%3A0%3A%22%22%3Bs%3A4%3A%22user%22%3Bi%3A0%3B%7D6e67ed658981abc1f6d25c605bcee246; _gat=1',
'origin': r'https://www.allmusic.com',
'referer': r'https://www.allmusic.com/advanced-search',
'user-agent': r'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36',
'x-requested-with': r'XMLHttpRequest',
}
#由先前之條件來建立data
#其中因考量專輯的廣度,當分類太多時,只隨機取得3種分類。(分類太多交集在一起資料太少)
if len(style_list) > 3:
amg_style = random.sample(style_list,3)
else:
amg_style = style_list
#建立post所需音樂類別的data
amg_label=""
for i in amg_style:
if i==amg_style[0]:
amg_label=amg_label+label_dict[i]
else:
amg_label=amg_label+"%26"+label_dict[i] #%26= & ((不確定為何一定要decode
#建立post所需音樂評價的data
amg_rating=rating_dict['5.0']+'|'+rating_dict['4.5']+'|'+rating_dict['4.0']
print (amg_label)
print (amg_rating)
print (amg_style)
###output###
#subgenreid:MA0000002800%26subgenreid:MA0000011915%26subgenreid:MA0000002829%26subgenreid:MA0000012028
#editorialrating:9|editorialrating:8|editorialrating:7
#['Psychedelic/Garage', 'Psychedelic Pop', 'Rock & Roll', 'Sunshine Pop']
5.開始爬曲專輯,以data(篩選條件)來做post,收集所有專輯資料並加進名為recommend的list中 (與前篇代碼雷同)
#將headers,data包給selenium-request做post請求
webdriver = Chrome(chrome_path)
res= webdriver.request('POST','https://www.allmusic.com/advanced-search/results/'
,headers=amg_header,data="filters[]=%s&filters[]=%s" %(amg_label,amg_rating))
res_soup = BeautifulSoup(res.text,'lxml')
#用開發人員工具,找到換頁的模式的規則產生下一次的respond
next_page = res_soup.find_all('span',{'class':'next'})
amg_url = 'https://www.allmusic.com/'
next_url=re.compile(r'(?=/advanced-search/results/)/advanced-search/results/\d+')
recommend = []
#建立濾除換行符號並加入list的function
def add_list(y):
y_list=[]
for x in y:
x_text=x.text
x_str=x_text.strip()
y_list.append(x_str)
return y_list
#建立BS抓取原始碼的function並用class包裝起來
class get_info():
def __init__(self,artist,title,year):
self.artist=artist
self.title=title
self.year=year
def get_source():
artist = res_soup.find_all('td',{'class':'artist'})
title = res_soup.find_all('td',{'class':'title'})
year = res_soup.find_all('td',{'class':'year'})
return get_info(artist,title,year)
#當資料小於等於六筆,放寬評價到3.5顆星,重新跑一次post。
if len(add_list(get_source().title))<7:
print ('need more low rating album, please wait...')
amg_rating='editorialrating:9|editorialrating:8|editorialrating:7|editorialrating:6'
webdriver = Chrome(chrome_path)
res= webdriver.request('POST','https://www.allmusic.com/advanced-search/results/'
,headers=amg_header,data="filters[]=%s&filters[]=%s" %(amg_label,amg_rating))
res_soup = BeautifulSoup(res.text,'lxml')
#用開發人員工具,找到換頁的模式的規則產生下一次的respond
next_page = res_soup.find_all('span',{'class':'next'})
amg_url = 'https://www.allmusic.com/'
next_url=re.compile(r'(?=/advanced-search/results/)/advanced-search/results/\d+')
i=0
while i <len(add_list(get_source().title)):
recommend.append(add_list(get_source().artist)[i]+"-"+add_list(get_source().title)[i]+"-"+add_list(get_source().year)[i])
i=i+1
#只有一頁的時候就做這個 (大於六筆資料)
elif next_page == [] and len(add_list(get_source().title))>6:
print ('laoding...only one page')
i=0
while i <len(add_list(get_source().title)):
recommend.append(add_list(get_source().artist)[i]+"-"+add_list(get_source().title)[i]+"-"+add_list(get_source().year)[i])
i=i+1
#有很多頁的話就做這個
else:
next_res = amg_url+re.search(next_url,str(next_page[-1])).group()
while next_page != []:
print ('loading...multi-page' )
i=0
while i <len(add_list(get_source().title)):
recommend.append(add_list(get_source().artist)[i]+"-"+add_list(get_source().title)[i]+"-"+add_list(get_source().year)[i])
i=i+1
res= webdriver.request('POST',next_res,headers=amg_header,data="filters[]=%s&filters[]=%s" %(amg_label,amg_rating))
res_soup = BeautifulSoup(res.text,'lxml')
next_page = res_soup.find_all('span',{'class':'next'})
if next_page != []:
next_res = amg_url+re.search(next_url,str(next_page[-1])).group()
else:
print ('loading...last page' )
i=0
while i <len(add_list(get_source().title)):
recommend.append(add_list(get_source().artist)[i]+"-"+add_list(get_source().title)[i]+"-"+add_list(get_source().year)[i])
i=i+1
#全部的list
#for r in recommend:
# print (r)
6.隨機選擇recommend中的3張專輯
for r in random.sample(recommend,3):
print (r)
Source credit: All Music : https://www.allmusic.com/
No copyright infringement intended.